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Abstract 

An important issue in the study of cities is defining a metropolitan area, as different definitions 
affect conclusions regarding the statistical distribution of urban activity. A commonly employed 
method of defining a metropolitan area is the Metropolitan Statistical Areas (MS As), based on 
rules attempting to capture the notion of city as a functional economic region, and is performed 
using experience. The construction of MSAs is a time-consuming process and is typically done only 
for a subset (a few hundreds) of the most highly populated cities. Here, we introduce a new method 
to designate metropolitan areas, denoted "City Clustering Algorithm" (CCA). The CCA is based 
on spatial distributions of the population at a fine geographic scale, defining a city beyond the 
scope of its administrative boundaries. We use the CCA to examine Gibrat's law of proportional 
growth, which postulates that the mean and standard deviation of the growth rate of cities are 
constant, independent of city size. We find that the mean growth rate of a cluster utilizing the CCA 
exhibits deviations from Gibrat's law, and that the standard deviation decreases as a power-law 
with respect to the city size. The CCA allows for the study of the underlying process leading to 
these deviations, which are shown to arise from the existence of long-range spatial correlations in 
population growth. These results have socio-political implications, for example for the location of 
new economic development in cities of varied size. 
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I. INTRODUCTION 



In recent years there has been considerable work on how to define cities and how the 
different definitions affect the statistical distribution of urban activity This is a 

long standing problem in spatial analysis of sources, referred to as the 

'modifiable areal unit problem' or the 'ecological fallacy' ^, 4|, where different definitions of 
spatial units based on administrative or governmental boundaries, give rise to inconsistent 
conclusions with respect to explanations and interpretations of data at different scales. The 
conventional method of defining human agglomerations is through the MSAs 

mm. 

which is subject to socio-economical factors. The MSA has been of indubitable importance 
for the analysis of population growth, and is constructed manually case-by-case based on 
subjective judgment (MSAs are defined starting from a highly populated central area and 
adding its surrounding counties if they have social or economical ties). 

In this report, we propose a new way to measure the extent of human agglomerations 
based on clustering techniques using a fine geographical grid, covering both urban and 
rural areas. In this view, "cities" represent clusters of population, i.e., adjacent populated 
geographical spaces. Our algorithm, the "city clustering algorithm" (CCA), allows for an 
automated and systematic way of building population clusters based on the geographical 
location of people. The CCA has one parameter (the cell size) that is useful for the study 
of human agglomerations at different length scales, similar to the level of aggregation in 
the context of social sciences. We show that the CCA allows for the study of the origin 
of statistical properties of population growth. We use the CCA to analyze the postulates 
of Gibrat's law of proportional growth applied to cities, which assumes that the mean and 
standard deviation of the growth rates of cities are constant. We show that population 
growth at a fine geographical scale for different urban and regional systems at country 
and continental levels (Great Britain, the USA, and Africa) deviates from Gibrat's law. 
We find that the mean and standard deviation of population growth rates decrease with 
population size, in some cases following a power-law behavior. We argue that the underlying 
demographic process leading to the deviations from Gibrat's law can be modeled from the 
existence of long-range spatial correlations in the growth of the population, which may arise 
from the concept that "development attracts further development." These results have 
implications for social policies, such as those pertaining to the location of new economic 
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development in cities of different sizes. Tlie present results imply that, on average, the 

greatest growth rate occurs in the smallest places where there is the greatest risk of failure 

(larger fluctuations). A corollary is that the safest growth occurs in the largest places having 

less likelihood for rapid growth. 

The analyzed data consist of the number of inhabitants, ni{t), in each cell i of a fine 

geographical grid at a given time, t. The cell size varies for each data set used in this study. 

We consider three different geographic scales: on the smallest scale, the area of study is Great 

Britain (GB: England, Scotland and Wales), a highly urbanized country with population 

of 58.7 million in 2007, and an area of 0.23 million km^. The grid is composed of 5.75 

fl 

million cells of 200m-by-200m [8|. At the intermediate scale, we study the USA (continental 
USA without Alaska), a single country nearly continental in scale, with a population of 
303 million in 2007, and an area of 7.44 million km^. The grid contains 7.44 million cells 
of approximately Ikm-by-lkm obtained from the US Census Bureau j^. The datasets of 
GB and USA are populated-places datasets, with population counts defined at points in a 
grid. Since there could be some distortions in the true residential population involved at 
the finest grid resolution, we perform our analysis by investigating the statistical properties 
as a function of the grid size by coarse-graining the data as explained in Section IIV A[ At 
the largest scale, we analyze the continent of Africa, composed of 53 countries with a total 
population of 933 million in 2007, and an area of 30.34 million km^. These data are gridded 
with less resolution by 0.50 million cells of approximately 7.74km-by-7.74km [lO^. More 
detailed information about these datasets is found in Section IIV Al (all the datasets studied 



in this paper are available at http://lev.ccny.cuny.edu/~hmakse/cities/city_data.zip). 



II. RESULTS 



Figure [T]A illustrates operation of the CCA. In order to identify urban clusters, we require 
connected cells to have nonzero population. We start by selecting an arbitrary populated 
cell (final results are independent of the choice of the initial cell). Iteratively, we then grow a 
cluster by adding nearest neighbors of the boundary cells with a population strictly greater 
than zero, until all neighbors of the boundary are unpopulated. We repeat this process until 
all populated cells have been assigned to a cluster. This technique was introduced to model 
forest fire dynamics and is termed the "burning algorithm," since one can think of each 
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populated cell as a burning tree. 

The population Si{t) of cluster i at time t is the sum of the populations n~-\t) 
of each cell j within it, Si{t) = where Ni is the number of cells 

in the cluster. Results of the CCA are shown in Fig. [T]B, representing the urban 
cluster surrounding the City of London (red cluster overlaying a satellite image, see 



http://lev.ccny.cuny.edu/~hmakse/cities/london.gif for an animated image of Fig.[Tj3). Fig- 



ure [Tp depicts all the clusters of GB, indicating the large variability in their population and 
size. 

A feature of the CCA is that it allows the analysis of the population clusters at different 
length scales by coarse-graining the grid and applying the CCA to the coarse-grained dataset 
(see Section HV Al for details on coarse-graining the data). At larger scales, disconnected 
areas around the edge of a cluster could be added into the cluster. This is justified when, 
for example, a town is divided by a wide highway or a river. 

Tables I and II in Supporting Information (SI) Section I. show a detailed comparison 
between the urban clusters obtained with the CCA applied to the USA in 1990, and the 



results obtained from the ana. 
studies of population growth 



^sis of MSAs from the US Census Bureau used in previous 



], y, [7]. We observe that the MSAs considered in Ref. [5|] are 
similar to the clusters obtained with the CCA with a cell size of 4km-by-4km or 8km-by- 
8km. In particular, the population sizes of the clusters have the same order of magnitude 
as the MSAs. On the other hand, for large cities the MSAs from the data of Ref. [6| seem 
to be mostly comparable to our results for cell sizes of 2km-by-2km or 4km-by-4km. 

Use of the CCA permits a systematic study of cluster dynamics. For instance, clus- 
ters may expand or contract, merge or split between two considered times as illustrated 
in Fig. [21 We quantify these processes by measuring the probability distribution of the 
temporal changes in the clusters for the data of GB. We find that when the cell size is 
2.2km-by-2.2km, 84% of the clusters evolve from 1981 to 1991 following the three first cases 
presented in Fig. [2] (no change, expansion or reduction), 6% of the clusters merge from two 
clusters into one in 1991, and 3% of the clusters split into two clusters. 

Next, we apply the CCA to study the dynamics of population growth by investigating 



Gibrat's 
constant 
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lich postulates that the mean and standard deviation of growth rates are 
121 ]. The conventional method is to assume that the populations 
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of a given city or cluster i, at times to and ti > to, are related by 



S, = RiSo)So, (1) 



where 5*0 = Si{to) = ^f' n^fito) and Si = Si{ti) = n^j\ti) are the initial and final 
populations of cluster i, respectively, and R{So) is the positive growth factor which varies 
from cluster to cluster. Following the literature in population dynamics 

naaa 

, we 

define the population growth rate of a cluster as r{So) = lni?(S'o) = ln(S'i/S'o), and study 
the dependence of the mean value of the growth rate, (r(S'o)), and the standard deviation, 
cr(S'o) = ^/ {r{Soy) — (r(S'o))^, on the initial population, So. The averages (r(S'o)) and a (So) 



are calculated applying nonparametric techniques [13|, [IJ] (see Section HVBI for details). To 
obtain the population growth rate of clusters we take into account that not all clusters 
occupy the same area between to and ti according to the cases discussed in Fig. [2J The 
figure shows how to calculate the growth rate r(S'o) in each case. ^ 

We analyze the population growth in the USA from to = 1990 to ti = 2000 [9]. We apply 
the CCA to identify the clusters in the data of 1990 and calculate their growth rates by 
comparing them to the population of the same clusters in 2000 when the data are gridded 
with a cell size of 2000m by 2000m. We calculate the annual growth rates by dividing r by 
the time interval ti — to. 

Figure[3]A shows a nonparametric regression with bootstrapped 95% confidence bands 



141 ] of the growth rate of the USA, (r(S'o)) (see Section HVBI for details). We find that the 



growth rate diminishes from (r(S'o)) ~ 0.012 ± 0.004 (error includes the confidence bands) 
for populations below 10^ inhabitants to (r(S'o)) ~ 0.002 ± 0.002 for the largest populations 
around 5*0 ~ 10^. We may argue that the mean growth rate deviates from Gibrat's law 
beyond the confidence bands. While it is difficult to fit the data to a single function for the 
entire range, the data show a decrease with 5*0 approximately following a power-law in the 
tail for populations larger than 10'^. An attempt to fit the data with a power-law yields the 
following scaling in the tail: 

{r{So)) ~ So"", (2) 

where a is the mean growth exponent, that takes a value ausA = 0.28 ±0.08 from Ordinary 
Least Squares (OLS) analysis [l5| (see Section [IVBI for details on OLS and on the estimation 
of the exponent error). 
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Figure[3]B shows the dependence of the standard deviation a{So) on the initial population 
5*0. On average, fluctuations in the growth rate of large cities are smaller than for small 
cities in contrast to Gibrat's law. This result can be approximated over many orders of 
magnitude by the power-law, 

a{So) ~ S^^, (3) 

where P is the standard deviation exponent. We carry out an OLS regression analysis and 
flnd that /5usa = 0.20 ± 0.06. The presence of a power-law implies that fluctuations in the 
growth process are statistically self-similar at different scales, for populations ranging from 
~1000 to ~10 million according to Fig. [3l3. 

Figure H] shows the analysis of the growth rate of the population clusters of GB from 
gridded databases js] with a cell size of 2.2km-by-2.2km at to = 1981 and ti = 1991. The 
average growth rate depicted in Fig. HJA. comprises large fluctuations as a function of So, 
especially for smaller populations. However, a slight decrease with population seems evident 
from rates around (r) ^ 0.008±0.001 with So ~ 10^ dropping to zero or even negative values 
for the largest populations, Sq ~ 10^. We flnd that 3556 clusters with population around 
5*0 = 10'^ exhibit negative growth rates as well. Thus, the mean rates are plotted on a 
semi-logarithmic scale in Fig. HJA.. When considering intermediate populations ranging from 
^0 = 3000 to ^0 = 3 X 10^, the data seem to be following approximately a power-law 
with ogb = 0.17 ± 0.05 from OLS regression analysis, as shown in the inset of Fig. |U\. 
Figure HlB shows the standard deviation for GB, cr(S'o), exhibiting deviations from Gibrat's 
law having a tendency to decrease with population according to Eq. ([3]) and a standard 
deviation exponent, /3gb = 0.27 ± 0.04, obtained with OLS technique. 

The CCA allows for a study of the growth rates as a function of the scale of observation, by 
changing the size of the grid. We flnd (SI Section II.) that the data for GB are approximately 
invariant under coarse-graining the grid at different levels for both the mean and standard 
deviation. When the data of the USA are aggregated spatially from cell size 2000m to 
8000m, the scaling of the mean rates crosses-over to a flat behavior closer to Gibrat's law. 
At the scale of 8000m the mean is approximately constant (with fluctuations). However, 
we flnd that, at this scale, all cities in the northeastern the USA spanning from Boston to 
Washington D.C. form a single cluster. Despite these differences, the scaling of the standard 
deviation for the USA holds approximately invariant even up to the large scale of observation 
of 8000m. 
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Next, we analyze the population growth in Africa during the period 1960 to 1990 |10 |. 
In this case the population data are based on a larger cell size, so we evaluate the data 
cell by cell (without the application of the CCA). Despite the differences in the economic 
and urban development of Africa, Great Britain and the USA, we find that the mean and 
standard deviation of the growth rate in Africa display similar scaling as found for the USA 
and GB. In Fig. [5]A we show the results for the growth rate in Africa when the grid is 
coarse-grained with a cell size of 77km-by-77km. We find a decrease of the growth rate from 
(r(S'o)) ~ 0.1 to (r(S'o)) ~ 0.01 between populations Sq ~ 10'^ and 5*0 ~ 10^, respectively. 
All populations have positive growth rates. A log-log plot of the mean rates shown in 
Fig. OA reveals a power-law scaling (r(S'o)) ~ with c^Af = 0.21 ± 0.05 from OLS 

regression analysis. The standard deviation (Fig. 03) satisfies Eq. with a standard 
deviation exponent /^Af = 0.19 ± 0.04. The CCA allows for a study of the origin of the 
observed behavior of the growth rates by examining the dynamics and spatial correlations 
of the population of cells. To this end, we first generate a surrogate dataset that consists of 
shuffling two randomly chosen populated cells, n^-\to) and n[!^(to), at time to. This swapping 
process preserves the probability distribution of n'^\ but destroys any spatial correlations 
among the population cells. Figure HP shows the results of the randomization of the GB 
dataset, indicating power-law scaling in the tail of (j{Sq) with standard deviation exponent 
/Srand = 1/2. This rcsult can be interpreted in terms of the uncorrelated nature of the 
randomized dataset (SI Section III). We consider that the population of each cell j increases 
by a random amount 5j with mean value 5 and variance (((5 - 5Y) = A2, and that r < 1, 
then n^j\ti) = n^j\to) + 6j. Therefore, the population of a cluster at time ti can be written 
as 

S, = So + Y,^r (4) 

i=i 

It can be shown that (SI Section III.): 

Ni Ni 
j k 

Randomly shuffling population cells destroys the correlations, leading to {{5j — 5){5k — 5)) = 
/^5jk (where 5jk is the Kronecker delta function) which implies /5rand = 1/2 [16] (see SI 
Section III.). 

The fact that /3 lies below the random exponent (/Srand = 1/2) for all the analyzed data 



. Indeed this 



19|. 



suggests that the dynamics of the population cells display spatial correlations, which are 
eliminated in the random surrogate data. The cells are not occupied randomly but spatial 
correlations arise, since when the population in one cell increases, the probability of growth 
in an adjacent cell also increases. That is, development attracts further development, an 
idea that has been used to model the spatial distribution of urban patterns 17 
ideas are related to the study of the origin of power-laws in complex systems 

When we analyze the populated cells, we indeed find that spatial correlations in the incre- 
mental population of the cells, 6j, are asymptotically of a scale- invariant form characterized 
by a correlation exponent 7, 

{{6,-S)iSk-S))- (6) 

where Xj is the location of cell j. For GB we find 7 = 0.93 ± 0.08 (see Fig. HP). In SI 
Section III. we show that power-law correlations in the fluctuations at the cell level, Eq. iQ, 
lead to a standard deviation exponent /3 = 7/4. For 7 = 2, the dimension of the substrate, 
we recover /3rand = 1/2 (larger values of 7 result in the same (3 since when 7 > 2 correlations 
become irrelevant). If 7 = 0, the standard deviation of the populations growth rates has 
no dependence on the population size (/3 = 0), as stated by Gibrat's law, stating that the 
standard deviation does not depend in the cluster size. In the case of GB, 7 = 0.93 ± 0.08 
gives P = 0.23 ± 0.02 approximately consistent with the measured value Pgb = 0.27 ± 0.04, 
within the error bars. This observation suggests that the underlying demographic process 
leading to the scaling in the standard deviation can be modeled as arising from the long-range 
correlated growth of population cells. 

III. DISCUSSION 

Our results suggest the existence of scale-invariant growth mechanisms acting at different 
geographical scales. Furthermore, Eq. ([3]) is similar to what is found for the growth of 



firms and other macroeconomic indicators 



16 



201 ]. Thus, our results support the existence 



of an underlying link between the fluctuation dynamics of population growth and various 
economic indicators, implying considerable unevenness in economic development in different 
population sizes. City growth is driven by many processes of which population growth and 
migration is only one. Our study captures only the growth of population but not economic 
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growth per se. Many cities grow economically while losing population and thus the processes 
we imply are those that influence a changing population. Our assumption is that population 
change is an indicator of city growth or decline and therefore we have based our studies on 
population clustering techniques. Alternatively, the MSAs provides a set of rules that try 
to capture the idea of city as a functional economic region. 

The results we obtain show scale-invariant properties which we have modelled using long- 
range spatial correlations between the population of cells. That is, strong development in 
an area attracts more development in its neighborhood and much beyond. A key finding is 
that small places exhibit larger fiuctuations than large places. The implications for locating 
activity in different places are that there is a greater probability of larger growth in small 
places, but also a greater probability of larger decline. Opportunity must be weighed against 
the risk of failure. 

One may take these ideas to a higher level of abstraction to study cell-to-cell flows (mi- 
gration, commuting, etc.) gridded at different levels. As a consequence one may deflne 
population clusters, or MSAs, in terms of functional linkages between neighboring cells. In 
addition one may relax some conditions imposed in the CCA. Here we consider a cell to be 
part of a cluster only if its population is strictly greater than 0. In SI Section V we relax 
this condition and study the robustness of the CCA when cells of a higher population than 
(for instance, 5 and 20) are allowed into clusters and flnd that even though small clusters 
present a slight deviation, the overall behavior of the growth rate and standard deviation is 
conserved. 



IV. MATERIALS AND METHODS 
A. Information on the datasets 



The datasets analyzed in this paper were obtained from the websites 



http: / / census.ac.uk , http: / /www.esri.com/] and jhttp: //na.unep.net/datasets/datalist.php 



for GB, USA and Africa, respectively, and can be downloaded from 



http : / / lev. ccny. cuny. edu/ ~hmakse / cities/city _dat a . zip 



The datasets consist of a list of populations at speciflc coordinates at two time steps to 
and ti- A graphical representation of the data can be seen in Fig. [T]C for GB where each 
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point represents a data point directly extracted from the dataset. 

To perform the CCA at different scales we coarse-grain the datasets. For this purpose, 
we overlay a grid on the corresponding map (USA, GB, or Africa) with the desired cell size 
(for example, 2km-by-2km or 4km-by-4km for the USA). Then, the population of each cell 
is calculated as the sum of the populations of points (obtained from the original dataset) 
that fall into this cell. 

Table [T] shows information on the datasets and results on USA, GB and Africa for the 
cell size used in the main text as well as some of the exponents obtained in our analysis. 



TABLE I: Characteristics of datasets and summary of results 



Data 


Number 
of cells 


to 


h 


Average 
growth rate 


Cell Size 


Number of 
clusters 


a 


/? 


USA 


1.86 mill 


1990 


2000 


0.9% 


2km-by-2km 


30,210 


0.28 ± 0.08 


0.20 ± 0.06 


GB 


0.10 mill 


1981 


1991 


0.3% 


2.2km-by-2.2km 


10,178 


0.17 ± 0.05 


0.27 ± 0.04 


Africa 


2,216 


1960 


1990 


4% 


77km-by-77km 


3,988 


0.21 ± 0.05 


0.19 ± 0.04 



B. Calculation of (r(5o)) and a{So) and methodology 

The average growth rate, (r(5'o)) = ln(5'i/S'o), and the standard deviation, a{So) 



a/ (r(S'o)^) — {r{So)y, are defined as follows. If we call P{r\So) the conditional probability 
distribution of finding a cluster with growth rate r(S'o) with the condition of initial popula- 
tion 5*0, then we can obtain r(S'o) and cr(S'o) through. 



{r{So)) = I rP{r\So)dr, (7) 

and 



(^('S'o)') = j r'P{r\So)dr. (8) 

are calculated for each cluster, we perform a nonparametric re- 
gression analysis [13|, llJ], a technique broadly used in the literature of population dynamics. 



Once r(S'o) and a (Si 



The idea is to provide an estimate for the relationship between the growth rate and Sn and 



between the standard deviation and Sq. Following the methods explained in Ref. 



]J] we 



10 



apply the Nadaraya- Watson method to calculate an estimate for the growth rate, r{So), 
with, 

Eallclusters ( q Q 1+ W ( Q \ 

and an estimate for the standard deviation 6"(S'o) with. 



where (to) is the population of cluster i at time to (as defined in the main text), rj(S'o) is 
the growth rate of cluster i and Kh{So — S'j(to)) is a gaussian kernel of the form, 

K,iSo-SM) = e h = 0.5 (11) 

Finally, we compute the 95% confidence bands (calculated from 500 random samples with 



replacement) to estimate the amount of statistical error in our results 13j. The bootstrap- 
ping technique was apphed by sampling as many data points as the number of clusters and 
performing the nonparametric regression on the sampled data. By performing 500 realiza- 
tions of the bootstrapping algorithm and extracting the so called a/2 {a is not related to 
the growth rate exponent) quantile we obtain the 95% confidence bands. 

To obtain the exponents a and P of the power-law scalings for (r(S'o)) and <j{Sq), respec- 
tively, we perform an OLS regression analysis [15]. More specifically, to obtain the exponent 
f3 from Eq. Q, we first linearize the data by considering the logarithm of the independent 
and dependent variables so that Eq. ([3]) becomes In cr(S'o) ~ /3 In Sq. Then, we apply a 
linear Ordinary Least Square regression that leads to the exponent 

g ^ iVcE£i[ln SM In a(g.(to))] -E^iln g.(to)E.=iln ^{SM) 
iV,E£i(ln^.(to))2-(E£iln^.(to))2 

where Nc is the number of clusters found using the CCA. Analogously, we obtain the expo- 
nent a by linearizing (|r(5'o)|) and calculating 

^ ^ NcEfM^^SM In (|r(g.(to))|)-E£ilng.(to)E£iln (|r(^.(to))|) 
A^cE£i(ln5.(to))^-(E£iln5.(to))^ 
Next we compute the 95% confidence interval for the exponents a and (3. For this we 
follow the book of Montgomery and Peck The 95% confidence interval for (3 is given 

by, 

^0.025,iVc-2 * se, (14) 
11 



where to,' /2,Nc-2 is the t-distribution with parameters a' /2 and Nc — 2 and se is the standard 
error of the exponent defined as 

where SSe is the residual and S^^ is the variance of Sq. 

Finally, we express the value of the exponent in terms of the 95% confidence intervals as, 

P±to.o25,Nc-2* se. (16) 
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FIG. 1: (A) Sketch illustrating the CCA applied to a sample of gridded population data. In the top 
left panel, cells are colored in blue if they are populated {nj'\t) > 0), otherwise, if nj'\t) = 0, they 
are in white. In the top right panel we initialize the CCA by selecting a populated cell and burning 
it (red cell). Then, wc burn the populated neighbors of the red cell as shown in the lower left panel. 
We keep growing the cluster by iteratively burning neighbors of the red cells until all neighboring 
cells are unpopulated, as shown in the lower right panel. Next, we pick another unburned populated 
cell and repeat the algorithm until all populated cells are assigned to a cluster. The population 
Si{t) of cluster i at time t is then Si{t) = (B) Cluster identified with the CCA in the 

London area (red) overlaying a corresponding satellite image (extracted from maps.google.com). 
The greenery corresponds to vegetation, and thus approximately indicates unoccupied areas. For 
example, Richmond Park can be found as a vegetation area in the south-west. The areas in the 
east along the Thames River correspond mainly to industrial districts and in the west the London 
Heathrow Airport, also not populated. The yellow line in the center represents the administrative 
boundary of the City of London, demonstrating the difference with the urban cluster found with the 
CCA. The pink clusters surrounding the major red cluster are smaller conglomerates not connected 
to London. The figure shows that an analysis based on the City of London captures only a partial 
area of the real urban agglomeration. (C) Result of the CCA applied to all of GB showing the 
large variability in the population distribution. The color bar (in logarithmic scale) indicates the 
population of each urban cluster. 

FIG. 2: Illustration of possible changes in cluster shapes. In each case we show how the growth 
rate is computed. In the first case, there is no areal modification in the cluster between c'-iid ti. 
In the second, the cluster expands. In the third the cluster reduces its area. In the fourth, one 
cluster divides into two and therefore we consider the population at ti to be Si = S[ + S". In the 
fifth case two clusters merge to form one at ti. In this case we consider the population at to to be 
So = Sq + Sq. 
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FIG. 3: Results for the USA using a cell size of 2000m-by-2000m. (A) Mean annual growth rate for 
population clusters in the USA versus initial population of the clusters. The straight dashed line 
shows a power-law fit with ausA = 0.28 ± 0.08 as determined using OLS regression. (B) Standard 
deviation of the growth rate for the USA. The straight dashed line corresponds to a power-law fit 
using OLS regression with /?usA = 0.20 ± 0.06. 

FIG. 4: Results for Great Britain using a cell size of 2.2km-by-2.2km. (A) Mean annual growth 
rate of population clusters in Great Britain versus the initial cluster population. The inset shows 
a double logarithmic plot of the growth rate in the intermediate range of populations, 3000 < 
S'o < 3 X 10^. A power-law fit using OLS leads to an exponent oqb = 0.17 it 0.05 for this range. 
(B) Double logarithmic plot of the standard deviation of the annual growth rates of population 
clusters in Great Britain versus the initial cluster population. The straight line corresponds to a 
power-law fit using OLS with an exponent /3gb = 0.27 ± 0.04, according to Eq. ([3]). (C) Scaling 
of the standard deviation in cluster population obtained from the randomized surrogate dataset 
of GB by randomly swapping the cells. The data shows an exponent Prund = 1/2 in the tail. The 
deviations for small S'o are discussed in the SI Section IV. where we test these results by generating 
random populations. (D) Long-range spatial correlations in the population growth of cells for GB 
according to Eq. ([6]). The straight line corresponds to an exponent 7 = 0.93 ± 0.08. 

FIG. 5: Results for Africa using a cell size of 77km-by-77km. (A) Mean growth rate of clusters in 
Africa versus the initial size of population S'o. The straight dashed line shows a power-law fit with 
exponent a^i = 0.21 it 0.05, obtained using OLS regression. (B) Standard deviation of the growth 
rate in Africa. The straight line corresponds to power-law fit using OLS providing the exponent 
/?Af = 0.19 ±0.04. 
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SUPPORTING INFORMATION 



Laws of Population Growth 

Hernan D. Rozenfeld, Diego Rybski, Jose S. Andrade Jr., 
Micliael Batty, H. Eugene Stanley, and Hernan A. Makse 



As supplementary materials we provide the following: In Section |V] we present tables 
with details on our results using the CCA and results presented in previous papers to allow 
for comparison between the different approaches. In Section |VT] we study the stability of the 
scaling found in the text under a change of scale in the cell size. In Section IVlIl we detail the 
calculations to relate spatial correlations between the population growth and cr(S'o) namely 
the relation /3 = 7/4. In Section IVIIII we describe the random surrogate dataset used to 
further test our results. In Section[IX]we further test the robustness of the CCA by proposing 
a small variation in the algorithm. 



V. CLUSTERS AT DIFFERENT SCALES AND COMPARISON WITH 
METROPOLITAN STATISTICAL AREAS 

In this section. Tables SI and S2 allow for a detailed comparison of urban clusters obtained 
with the CCA applied to the USA in 1990, and the populations of MSA from US Census 



,y,3 



Bureau used in previous studies of population growth 

We can see that the MSA presented by Eeckhout (2004) typically correspond to our 
clusters using cell sizes of 4km and 8km. For example, for the New York City region 
Eeckhout's data are well approximated by a cell size of 4km, but Los Angeles is better 
approximated when using a cell size of 8km. On the other hand Dobkins-Ioannides (2000) 
data are better described by cell sizes of 2km or 4km. For instance, Chicago is well described 
by a cell size of 4km and Los Angeles is better described by a cell size of 2km. 

An interesting remark is that the population of Los Angeles when using cell sizes of 2km, 
4km and 8km does not vary as much as that for New York. This could be caused by the 
fact that major cities in the northeast of USA are closer to each other than large cities in 
the southwest, which may be attributed to land or geographical constraints. 
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It is important relate the results of Table S2 with an ecological fallacy. As the cell size 
is increased, the population of a cluster also increases, as expected, because the cluster now 
covers a larger area. This is not a direct manifestation of an ecological fallacy which, would 
appear if the statistical results (growth rate vs. S or standard deviation vs. S) gave different 
results as the cell size increases. In Fig. 1 and Fig. 2 in the SI Section Wl\ we observe that 
the growth rate and standard deviation for the USA and GB follow the same form, except 
for the case of the growth rate in the USA in which different cell sizes show deviations from 
each other. The later may be an indicative of an ecological fallacy. In this case, it is not 
obvious what cell size is the "correct" one. We consider this point (the possibility to choose 
the cell size) to be a feature of the CCA, since one may appropriately pick the cell size 
according to the specific problem one is studying. 

Table SI: Top 10 largest MSA of the USA in 1990 from previous analysis of 
population growth 





Dobkins - loannides 


Eeckhout 




MSA 


Population 


MSA 


Population 


1 


NYC NY206 


9,372,000 


NYC-North NJ-Long Is., NY-NJ-CT-PA 


19,549,649 


2 


Los Angeles CA172 


8,863,000 


Los Angeles-Riverside-Orange County, CA 


14,531,529 


3 


Chicago IL59 


7,333,000 


Chicago-Gary-Kenosha, IL-IN-WI 


8,239,820 


4 


Philadelphia PA228 


4,857,000 


Washington-Baltimore, DC-MD-VA-WV 


6,727,050 


5 


Detroit MI80 


4,382,000 


San Francisco-Oakland-San Jose, CA 


6,253,311 


6 


Washington DC312 


3,924,000 


Philadelphia- Wilmington- Atlantic City 
PA-NJ-DE-MD 


5,892,937 


7 


San Francisco CA266 


3,687,000 


Boston- Worcester-Lawrence, MA-NH-ME-CT 


5,455,403 


8 


Houston TX129 


3,494,000 


Detroit- Ann Arbor-Flint, MI 


5,187,171 


9 


Atlanta GA19 


2,834,000 


Dallas-Fort Worth, TX 


4,037,282 


10 


Boston MA39 


2,800,000 


Houston-Galveston-Brazoria, TX 


3,731,131 
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Table S2: Top 10 largest clusters of the USA in 1990 from our analysis for 
different cell sizes. The city names are the major cities that belong to the clusters and 
were picked to show the areal extension of the cluster. 





Cell = 1km 


Cell = 2km 


Cell = 4km 


Cell = 8km 




Cluster 


Population 


Cluster 


Population 


Cluster 


Population 


Cluster 


Population 


1 


NYC 


7,012,989 


NYC-Long Is. 
Newark 
Jersey City 


12,511,237 


NYC-Long Is. 
N. NJ-Newark 
Jersey City 


17,064,816 


NYC-Long Is. 

North NJ 
Philadelphia 
D.C.-Boston 


41,817,858 


2 


Chicago 


2,312,783 


Los Angeles 
Long Beach 


9,582,507 


Los Angeles 
Long Beach 
Pomona 


10,878,034 


Los Angeles 
San Clemente 

"r> • * J 

Riverside 


13,304,233 


6 


Los Angeles 


1,411,791 


Chicago 
Rockford 


4,836,529 


Chicago 

Gary 
Rockford 


7,230,404 


Chicago 

Gary 
Rockrord 
Milwaukee 


9,288,345 


4 


Philadelphia 


1,282,834 


Philadelphia 
Wilmington 


3,151,704 


Washington 
Baltimore 
Springfield 


5,316,890 


San Francisco 
Santa Cruz 
Brentwood 


5,736,479 


5 


Boston 


759,024 


Detroit 


2,906,453 


Til "1 111' 

Philadelphia 

Trenton 
Wilmington 


4,935,734 


Detroit 
Ann Arbor 
Monroe 
Sarnia 


A A A r\ 't 

4,442,723 


6 


Newark 


581,048 


San Francisco 
San Jose 


2,601,639 


San Francisco 
San Jose 
Concord 


4,766,960 


Miami 
Port St. Lucie 


4,000,432 


7 


San Francisco 


507,300 


Washington 
Alexandria 
Bethesda 


2,059,421 


Detroit 
Waterford 
Canton 


3,722,778 


Dallas 
Fort Worth 


3,536,186 


8 


Washington 


504,068 


Phoenix 


1,556,077 


Miami 
W. Palm Beach 


3,719,773 


Houston 


3,425,647 


9 


Jersey City 


438,591 


Boston 
Lowell 
Quincy 


1,498,208 


Dallas 
Fort Worth 


3,134,233 


Cleveland 
Canton 


3,233,341 


10 


Baltimore 


437,413 


Miami 


1,465,490 


Boston 
Brockton 
Nashua 


3,064,925 


Pittsburgh 
Youngstown 
Morgantown 


3,214,661 
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FIG. 6: Sensitivity of the results under coarse-graining of the data for GB. (A) Average growth 
rate and (B) standard deviation for GB using the clustering algorithm for different cell size. The 
dashed line represents the OLS regression estimate for the exponents (A) ogb = 0.17 and (B) 
Pgb = 0.27 obtained in the main text. For clarity we do not show the confidence bands. 

VI. SCALING UNDER COARSE-GRAINING 

In this section we test the sensitivity of our results to a coarse-graining of the data. We 
analyze the average growth rate (r(5'o)) and the standard deviation cr{So) for GB and the 
USA by coarse-graining the data sets at different levels. 

In Fig. [6l\ we observe that although the results are not identical for all coarse-grainings, 
they are statistically similar, showing a slight decay in the growth rate. Moreover, we see 
that cities of size 5*0 ~ 10^ and So ~ 10^ still exhibit a tendency to have negative growth 
rates for all levels of coarse-graining, as explained in the main text. In the case of the USA 
(Fig. [7]\) there is a crossover to a flat behavior at a cell size of 8000m, although at this scale 
all the northeast USA becomes a large cluster of 41 million inhabitants. On the other hand, 
Figs. [6f3, [7]B show that the scaling of Eq. (3) in the main text, cr(S'o) ~ S^^, still holds when 
using the coarse-grained datasets on both GB and the USA. 

VII. CORRELATIONS 

In this section we elaborate on the calculations leading to the relation between Gibrat's 
law and the spatial correlations in the cell population. We first show that when the pop- 
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FIG. 7: Study of results under coarse-graining of the data for the USA. (A) Average growth rate 
and (B) standard deviation for the USA using the clustering algorithm for different cell size. The 
dashed line represents the OLS regression estimate for the exponents (A) ausA = 0.28 and (B) 
PvsA = 0.20 obtained in the main text. For clarity we do not show the confidence bands. 



ulation cells are randomly shuffled (destroying any spatial correlations between the growth 
rates of the cells), the standard deviation of the growth rate becomes ct^Sq) ~ S'q'^™'*, where 



rand 



1/2 



16]. Then, we show that long-range spatial correlations in the population of 
the cells leads to the relation /5 = 7/4 as stated at the end of Section II in the main text. 

Assuming that the population growth rate is small (r ^ 1), we can write i? = ~ 1 + r. 
Replacing i? = 1 + r in Eq. (1) in the main text we obtain 



Si = So + Sor. 



(17) 



We define the standard deviation of the populations Si as ai, which is a function of Sq: 



aiiSo) = ,J{S!) - {Si)'. 



(18) 



This quantity is easier to relate to the spatial correlations of the cells than the standard 



deviation ct^Sq) of the growth rates r. Then, since (5*1) 
2S^{r) + S^{r'), we obtain, 



So + So{r) and {Sf) = S^ + 



ai(5o)~5o(T(5o), (19) 
where a{So) = \/ {r'^) — {r^ as defined in the main text. Therefore, using Eq. (3) in the 
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main text, 



ai{So)^S^-'. (20) 

As stated in the main text, the total population of a cluster at time to is the sum of the 
populations of each cell, 5*0 = Xlj^i^*''; where iVj is the number of cells in cluster i. The 
population of a cluster at time ti can be written as 

Si = So + J2^j^ (21) 

where Sj is the increment in the population of cell j from time to to ti (notice that Sj can 
be negative). Therefore, the standard deviation (Ti(S'o) is 

{criiSo)) = J^{S,S,) - {Y^5,)' = Y^iiS, - 5)i5k - 5)). (22) 

j,k j j,k 

After the process of randomization explained in Section II main text, the correlations 
between the increment of population in each cell are destroyed. Thus, 

{{6,-6){6k-S))=A'6jk, (23) 

where = 5^ - 6"^. Replacing in Eq. ([22]) and since (n) = {l/Ni)Y,fnj = Sq/N^, we 
obtain 

ai{So)y = N,A^ ^ So. (24) 

Comparing with Eq. fl20|) we obtain /3rand = 1/2 for this uncorrelated case. 

Let us assume that the correlation of the population increments 6j, decays as a power-law 
of the distance between cells indicating long-range scale-free correlations. Thus, asymptoti- 
cally 

{(6,-6)i6,-6))^ (25) 

where xj denotes the position of the cell j and 7 is the correlation exponent (for \xj — Xk\ 0, 
the correlations {{6j — 6){6k — S)) tend to a constant). For large clusters, we can approximate 
the double sum in Eq. fl22|) by an integral. Then, assuming that the shape of the clusters 
can be approximated by disks of radius Vc, for 7 < 2 we obtain 
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where is the area of each cell and Vc the radius of the cluster. Since Tc ~ A^iCt^, we finally 
obtain, 



Equation fl251) shows that Gibrat's Law is recovered when the correlation of the population 
increments is a constant, independent from the positions of the cells; that is when all the 
populations cells are increased equally. In other words, if 7 = 0, the standard deviation of 
the populations growth rates has no dependence on the population size {(3 = 0), as stated by 
Gibrat's law. The random case is obtained for 7 = (i, where c? = 2 is the dimensionality of the 
substrate. In this case d = 2 and /5rand = 1/2- For 7 > 2, the correlations become irrelevant 
and we still find the uncorrected case /3rand = 1/2. For intermediate values < 7 < 2 we 
obtain < /5 = 7/4 < 1/2. 

VIII. RANDOM SURROGATE DATASET 

In this section we elaborate on the randomization procedure used to understand the role 
of correlations in population growth. 

Figure 4C in the main text shows the standard deviation (j{Sq) when the population 
of each cluster is randomized, breaking any spatial correlation in population growth. For 
clusters with a large population, cr(S'o) follows a power-law with exponent /3rand = 1/2, 
and for small 5*0, cr(S'o) presents deviations from the power-law function as seen in Fig. 4G 
with smaller standard deviation than the prediction of the random case. This deviation is 
caused by the fact that the population of a cluster is bound to be positive: a cluster with a 
small population Sq cannot decrease its population by a large number, since it would lead 
to negative values of Si. This produces an upper bound in fluctuations of the growth rate 
for small 5*0 and results in smaller values of cr(S'o) than expected (below the scaling with 
exponent /^^and = 1/2). 

To support this argument, we carry out simulations using the clusters of GB, where the 
population rijlto) of each cell j is replaced with random numbers following an exponential 
distribution with probability P{nj) ~ e""^/"". The decay-constant, Uq = 150, is extracted 




(27) 



Using 5*0 = Ni{n) and Eq. (!20l) we arrive at. 



7 
4 



(28) 
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from the data of GB to mimic the original distribution. This is done through a direct measure 
of P{nj) from the GB dataset and fitting the data using OLS regression analysis. We obtain 
the population rijiti) = rijito) + 5j of cell j at time ti by picking random numbers for the 
population increments 5j following a uniform distribution in the range — g*150 < 5i < g*150. 
Here q determines the variance of the increments. Since the population cannot be negative 
we impose the additional condition rijiti) > 0. Figure [8] shows the results of the standard 
deviation a{So) for four different g-values for this uncorrelated model. We find that the tail 
of <7{So) reproduces the uncorrelated exponent Prund = 1/2- For small Sq we find that the 
standard deviation levels off to an approximately constant value as in the surrogate data of 
Fig 4C. The crossover from an approximately constant cr(S'o) to a power-law moves to smaller 
values of the population 5*0 as the standard deviation in the 6j is smaller (smaller value of q). 
Such behavior can be understood since the condition n^j\ti) > imposes a lower "wall" in 
the random walk specified by ri'-\ti) = n^*^(to) + Sj. As the initial population gets smaller, 
the walker "feels" the presence of the wall and the fluctuations decrease accordingly, thus 
explaining the deviations from the power- law with exponent /3rand = 1/2 for small population 
values. Therefore, as the value of q decreases, the small population plateau disappears as 
observed in Fig. [HI 

IX. A VARIATION OF THE CCA 

In this section we study a variation of the CCA. In the main text we stop growing a cluster 
when the population of all boundary cells have unpopulated, that is, have population exactly 
0. In other words, clusters are composed by cell with population strictly greater than 0. It 
is important to analyze whether this stopping criteria can be relaxed to including cell which 
have a population larger that a given threshold. In Fig. [9]A and Fig. [9l3 we show the results 
for the population growth rate and standard deviation, respectively, in GB when the cell 
size is 2.2km-by-2.2km (as in the main text) but including cells with a population strictly 
larger than 5 and 20. 

Although for small population clusters we observe a slight variation in the growth rate 
and in the standard deviation, the results show that the thresholds do not influence the 
global statistics when compared to the plots in the main text. 
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FIG. 8: Standard deviation (t(S'o) for the random data set as explained in the SI Section [Villi 
The results for o"(5'o) are rescaled to collapse the power-law tails with exponent /?rand = 1/2 and to 
emphasize the deviations from this function for small values of 5*0. The larger the parameter q, the 
larger the deviations from the power-law at lower Sq. In other words, the crossover to power-law 
tail appears at larger as q increases. 
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FIG. 9: Sensitivity of the results under a change in the stopping criteria in the CCA (A) Average 
growth rate for GB with a population threshold of 5 (green line) and 20 (black dashed line) and 
(B) standard deviation for GB with a population threshold of 5 (green line) and 20 (black dashed 
line). For clarity we do not show the confidence bands. 
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